{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e541c8c8",
   "metadata": {
    "id": "e541c8c8"
   },
   "source": [
    "<center>\n",
    "    <p style=\"text-align:center\">\n",
    "        <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n",
    "        <br>\n",
    "        <a href=\"https://docs.arize.com/phoenix/\">Docs</a>\n",
    "        |\n",
    "        <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
    "        |\n",
    "        <a href=\"https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q\">Community</a>\n",
    "    </p>\n",
    "</center>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c2c8e6f",
   "metadata": {
    "id": "1c2c8e6f"
   },
   "source": [
    "# Trace-Level Evals for a Movie Recommendation Agent\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a68d55ab",
   "metadata": {
    "id": "a68d55ab"
   },
   "source": [
    "This notebook demonstrates how to run trace-level evaluations for a movie recommendation agent. By analyzing individual traces, each representing a single user request, you can gain insights into how well the system is performing on a per-interaction basis. Trace-level evaluations are particularly valuable for identifying successes and failures for end-to-end performance.\n",
    "\n",
    "In this notebook, you will:\n",
    "\n",
    "- Build and capture interactions (traces) from your movie recommendation agent\n",
    "- Evaluate each trace across key dimensions such as Recommendation Relevance and Tool Usage\n",
    "- Format the evaluation outputs to match Arize’s schema and log them to the platform\n",
    "- Learn a robust pipeline for assessing trace-level performance\n",
    "\n",
    "✅ You will need a free [Phoenix Cloud account](https://app.arize.com/auth/phoenix/login) and an OpenAI API key to run this notebook.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "549464f8",
   "metadata": {
    "id": "549464f8"
   },
   "source": [
    "# Set Up Keys & Dependencies\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5b425a65",
   "metadata": {
    "collapsed": true,
    "id": "5b425a65"
   },
   "outputs": [],
   "source": [
    "%pip install openinference-instrumentation-openai openinference-instrumentation-openai-agents openinference-instrumentation arize-phoenix arize-phoenix-otel nest_asyncio openai openai-agents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15652859",
   "metadata": {
    "id": "15652859"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "if not (phoenix_endpoint := os.getenv(\"PHOENIX_COLLECTOR_ENDPOINT\")):\n",
    "    phoenix_endpoint = getpass(\"🔑 Enter your Phoenix Collector Endpoint: \")\n",
    "os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = phoenix_endpoint\n",
    "\n",
    "\n",
    "if not (phoenix_api_key := os.getenv(\"PHOENIX_API_KEY\")):\n",
    "    phoenix_api_key = getpass(\"🔑 Enter your Phoenix API key: \")\n",
    "os.environ[\"PHOENIX_API_KEY\"] = phoenix_api_key\n",
    "\n",
    "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n",
    "    openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n",
    "os.environ[\"OPENAI_API_KEY\"] = openai_api_key"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f47e0638",
   "metadata": {
    "id": "f47e0638"
   },
   "source": [
    "# Configure Tracing\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "36dbc00a43257651",
   "metadata": {
    "id": "36dbc00a43257651"
   },
   "outputs": [],
   "source": [
    "from phoenix.otel import register\n",
    "\n",
    "# configure the Phoenix tracer\n",
    "tracer_provider = register(project_name=\"movie-rec-agent\", auto_instrument=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c81c9bb",
   "metadata": {
    "id": "7c81c9bb"
   },
   "source": [
    "First, we need to define the tools that our recommendation system will use. For this example, we will define 3 tools:\n",
    "\n",
    "1. Movie Selector: Based on the desired genre indicated by the user, choose up to 5 recent movies availabtle for streaming\n",
    "2. Reviewer: Find reviews for a movie. If given a list of movies, sort movies in order of highest to lowest ratings.\n",
    "3. Preview Summarizer: For each movie, return a 1-2 sentence description\n",
    "\n",
    "Our most ideal flow involves a user simply giving the system a type of movie they are looking for, and in return, the user gets a list of options returned with descriptions and reviews.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c64844a7",
   "metadata": {
    "id": "c64844a7"
   },
   "source": [
    "Let's test our agent & view traces in Arize\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1731db8",
   "metadata": {
    "id": "f1731db8"
   },
   "outputs": [],
   "source": [
    "import ast\n",
    "from typing import List, Union\n",
    "\n",
    "from agents import Agent, Runner, function_tool\n",
    "from openai import OpenAI\n",
    "from opentelemetry import trace\n",
    "\n",
    "tracer = trace.get_tracer(__name__)\n",
    "\n",
    "client = OpenAI()\n",
    "\n",
    "\n",
    "@function_tool\n",
    "def movie_selector_llm(genre: str) -> List[str]:\n",
    "    prompt = (\n",
    "        f\"List up to 5 recent popular streaming movies in the {genre} genre. \"\n",
    "        \"Provide only movie titles as a Python list of strings.\"\n",
    "    )\n",
    "    response = client.chat.completions.create(\n",
    "        model=\"gpt-4\",\n",
    "        messages=[{\"role\": \"user\", \"content\": prompt}],\n",
    "        temperature=0.7,\n",
    "        max_tokens=150,\n",
    "    )\n",
    "    content = response.choices[0].message.content\n",
    "    try:\n",
    "        movie_list = ast.literal_eval(content)\n",
    "        if isinstance(movie_list, list):\n",
    "            return movie_list[:5]\n",
    "    except Exception:\n",
    "        return content.split(\"\\n\")\n",
    "\n",
    "\n",
    "@function_tool\n",
    "def reviewer_llm(movies: Union[str, List[str]]) -> str:\n",
    "    if isinstance(movies, list):\n",
    "        movies_str = \", \".join(movies)\n",
    "        prompt = f\"Sort the following movies by rating from highest to lowest and provide a short review for each:\\n{movies_str}\"\n",
    "    else:\n",
    "        prompt = f\"Provide a short review and rating for the movie: {movies}\"\n",
    "    response = client.chat.completions.create(\n",
    "        model=\"gpt-4\",\n",
    "        messages=[{\"role\": \"user\", \"content\": prompt}],\n",
    "        temperature=0.7,\n",
    "        max_tokens=300,\n",
    "    )\n",
    "    return response.choices[0].message.content.strip()\n",
    "\n",
    "\n",
    "@function_tool\n",
    "def preview_summarizer_llm(movie: str) -> str:\n",
    "    prompt = f\"Write a 1-2 sentence summary describing the movie '{movie}'.\"\n",
    "    response = client.chat.completions.create(\n",
    "        model=\"gpt-4\",\n",
    "        messages=[{\"role\": \"user\", \"content\": prompt}],\n",
    "        temperature=0.7,\n",
    "        max_tokens=100,\n",
    "    )\n",
    "    return response.choices[0].message.content.strip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8ee91369",
   "metadata": {
    "id": "8ee91369"
   },
   "outputs": [],
   "source": [
    "agent = Agent(\n",
    "    name=\"MovieRecommendationAgentLLM\",\n",
    "    tools=[movie_selector_llm, reviewer_llm, preview_summarizer_llm],\n",
    "    instructions=(\n",
    "        \"You are a helpful movie recommendation assistant with access to three tools:\\n\"\n",
    "        \"1. MovieSelector: Given a genre, returns up to 5 recent streaming movies.\\n\"\n",
    "        \"2. Reviewer: Given one or more movie titles, returns reviews and sorts them by rating.\\n\"\n",
    "        \"3. PreviewSummarizer: Given a movie title, returns a 1-2 sentence summary.\\n\\n\"\n",
    "        \"Your goal is to provide a helpful, user-friendly response combining relevant information.\"\n",
    "    ),\n",
    ")\n",
    "\n",
    "\n",
    "async def main():\n",
    "    user_input = \"Which comedy movie should I watch?\"\n",
    "    result = await Runner.run(agent, user_input)\n",
    "    print(result.final_output)\n",
    "\n",
    "\n",
    "await main()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6bef5b97",
   "metadata": {
    "id": "6bef5b97"
   },
   "source": [
    "Next, we’ll run the agent a few more times to generate additional traces. Feel free to adapt or customize the questions as you see fit.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5cf62bca",
   "metadata": {
    "id": "5cf62bca"
   },
   "outputs": [],
   "source": [
    "questions = [\n",
    "    \"Which Batman movie should I watch?\",\n",
    "    \"I want to watch a good romcom\",\n",
    "    \"What is a very scary horror movie?\",\n",
    "    \"Name a feel-good holiday movie\",\n",
    "    \"Recommend a musical with great songs\",\n",
    "    \"Give me a classic drama from the 90s\",\n",
    "]\n",
    "\n",
    "for question in questions:\n",
    "    result = await Runner.run(agent, question)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f819c8ee",
   "metadata": {
    "id": "f819c8ee"
   },
   "source": [
    "# Get Span Data from Phoenix\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0de0363e",
   "metadata": {
    "id": "0de0363e"
   },
   "source": [
    "Before running our evaluations, we first retrieve the span data from Arize. We then group the spans by trace and separate the input and output values.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "981913a80fa0a780",
   "metadata": {
    "id": "981913a80fa0a780"
   },
   "outputs": [],
   "source": [
    "from phoenix.client import AsyncClient\n",
    "\n",
    "px_client = AsyncClient()\n",
    "primary_df = await px_client.spans.get_spans_dataframe(project_identifier=\"movie-rec-agent\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91b78eee",
   "metadata": {
    "id": "91b78eee"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "trace_df = primary_df.groupby(\"context.trace_id\").agg(\n",
    "    {\n",
    "        \"attributes.input.value\": \"first\",\n",
    "        \"attributes.output.value\": lambda x: \" \".join(x.dropna()),\n",
    "    }\n",
    ")\n",
    "trace_df = trace_df.rename(\n",
    "    columns={\n",
    "        \"attributes.input.value\": \"input\",\n",
    "        \"attributes.output.value\": \"output\",\n",
    "    }\n",
    ")\n",
    "trace_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d014913f",
   "metadata": {
    "id": "d014913f"
   },
   "source": [
    "# Define and Run Evaluators\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a426e4ca",
   "metadata": {
    "id": "a426e4ca"
   },
   "source": [
    "In this tutorial, we will evaluate two aspects: tool usage and relevance. You can add any additional evaluation templates you like. We will then run the evaluations using an LLM as the judge.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d0f1768d",
   "metadata": {
    "id": "d0f1768d"
   },
   "outputs": [],
   "source": [
    "TOOL_CALLING_ORDER = \"\"\"\n",
    "You are evaluating the correctness of the tool calling order in an LLM application's trace.\n",
    "\n",
    "You will be given:\n",
    "1. The user input that initiated the trace\n",
    "2. The full trace output, including the sequence of tool calls made by the agent\n",
    "\n",
    "##\n",
    "User Input:\n",
    "{input}\n",
    "\n",
    "Trace Output:\n",
    "{output}\n",
    "##\n",
    "\n",
    "Respond with exactly one word: `correct` or `incorrect`.\n",
    "1. `correct` →\n",
    "- The tool calls occur in the appropriate order to fulfill the user's request logically and effectively.\n",
    "- A proper answer involves calls to reviews, summaries, and recommendations where relevant.\n",
    "2. `incorrect` → The tool calls are out of order, missing, or do not follow a coherent sequence for the given input.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "131a5e7f",
   "metadata": {
    "id": "131a5e7f"
   },
   "outputs": [],
   "source": [
    "RECOMMENDATION_RELEVANCE = \"\"\"\n",
    "You are evaluating the relevance of movie recommendations provided by an LLM application.\n",
    "\n",
    "You will be given:\n",
    "1. The user input that initiated the trace\n",
    "2. The list of movie recommendations output by the system\n",
    "\n",
    "##\n",
    "User Input:\n",
    "{input}\n",
    "\n",
    "Recommendations:\n",
    "{output}\n",
    "##\n",
    "\n",
    "Respond with exactly one word: `correct` or `incorrect`.\n",
    "1. `correct` →\n",
    "- All recommended movies match the requested genre or criteria in the user input.\n",
    "- The recommendations should be relevant to the user's request and shouldn't be repetitive.\n",
    "- `incorrect` → one or more recommendations do not match the requested genre or criteria.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ce8d7f1b",
   "metadata": {
    "id": "ce8d7f1b"
   },
   "outputs": [],
   "source": [
    "from phoenix.evals import LLM, async_evaluate_dataframe, create_classifier\n",
    "\n",
    "llm = LLM(provider=\"openai\", model=\"gpt-4o-mini\")\n",
    "\n",
    "\n",
    "tone_evaluator = create_classifier(\n",
    "    name=\"tool calling\",\n",
    "    llm=llm,\n",
    "    prompt_template=TOOL_CALLING_ORDER,\n",
    "    choices={\"correct\": 1.0, \"incorrect\": 0.0},\n",
    ")\n",
    "\n",
    "relevance_evaluator = create_classifier(\n",
    "    name=\"relevance\",\n",
    "    llm=llm,\n",
    "    prompt_template=RECOMMENDATION_RELEVANCE,\n",
    "    choices={\"correct\": 1.0, \"incorrect\": 0.0},\n",
    ")\n",
    "\n",
    "\n",
    "results_df = await async_evaluate_dataframe(\n",
    "    dataframe=trace_df,\n",
    "    evaluators=[tone_evaluator, relevance_evaluator],\n",
    ")\n",
    "results_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4613e975",
   "metadata": {
    "id": "4613e975"
   },
   "source": [
    "# Log Results Back to Phoenix\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "afd4401d",
   "metadata": {
    "id": "afd4401d"
   },
   "source": [
    "The final step is to log our results back to Arize. After running the cell below, you’ll be able to view your trace-level evaluations on the platform, complete with relevant labels, scores, and explanations.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "IF25CDHBcxPL",
   "metadata": {
    "id": "IF25CDHBcxPL"
   },
   "outputs": [],
   "source": [
    "from phoenix.evals.utils import to_annotation_dataframe\n",
    "\n",
    "root_spans = primary_df[primary_df[\"parent_id\"].isna()][[\"context.trace_id\", \"context.span_id\"]]\n",
    "\n",
    "# Merge results with root spans to align on trace_id\n",
    "results_with_spans = pd.merge(\n",
    "    results_df.reset_index(), root_spans, on=\"context.trace_id\", how=\"left\"\n",
    ").set_index(\"context.span_id\", drop=False)\n",
    "\n",
    "# Format for Phoenix logging\n",
    "annotation_df = to_annotation_dataframe(dataframe=results_with_spans)\n",
    "\n",
    "await px_client.spans.log_span_annotations_dataframe(\n",
    "    dataframe=annotation_df,\n",
    "    annotator_kind=\"LLM\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "xVG-DsKlWb6c",
   "metadata": {
    "id": "xVG-DsKlWb6c"
   },
   "source": [
    "![Results](https://storage.googleapis.com/arize-phoenix-assets/assets/images/trace_level_evals_phoenix.png)\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
